HW 4 Missing Data & Model Selection: Data Analysis Problems

Advanced Regression (STAT 353-0)

Author

Zihan Chu

Published

February 28, 2025

Important

All students are required to complete this problem set!

library(tidyverse)
library(mice) 
library(car) 
library(MASS)

Data analysis problems

1. Exercise D20.1 (MI)

Using the United Nations social-indicators data (in UnitedNations.txt), develop a regression model for the response variable female expectation of life. Feel free to use whatever explanatory variables in the data set make sense to you and to employ variable transformations, if needed.

  1. Work initially with complete cases, and once you have an apparently satisfactory model, obtain estimates and standard errors of the regression coefficients.
Solution
united_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/UnitedNations.txt")
united_data$region <- as.factor(united_data$region)
united_data_complete <- na.omit(united_data)

# Fit a linear regression model using complete cases
model_complete <- lm(lifeFemale ~ ., united_data_complete)
summary(model_complete)

Call:
lm(formula = lifeFemale ~ ., data = united_data_complete)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.07954 -0.61680  0.03489  0.51446  1.60556 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.375e+01  6.740e+00   3.524  0.00182 ** 
regionAmerica           3.999e-01  1.129e+00   0.354  0.72643    
regionAsia             -1.335e-01  1.124e+00  -0.119  0.90643    
regionEurope            2.248e+00  1.151e+00   1.954  0.06295 .  
regionOceania          -4.413e-01  1.926e+00  -0.229  0.82081    
tfr                    -1.362e+00  5.801e-01  -2.347  0.02788 *  
contraception          -2.341e-02  2.044e-02  -1.146  0.26374    
educationMale           5.993e-01  5.222e-01   1.148  0.26294    
educationFemale        -6.592e-01  4.926e-01  -1.338  0.19388    
lifeMale                7.815e-01  8.484e-02   9.211 3.52e-09 ***
infantMortality        -1.193e-02  2.853e-02  -0.418  0.67968    
GDPperCapita            8.485e-05  5.089e-05   1.667  0.10903    
economicActivityMale    4.226e-02  4.414e-02   0.957  0.34828    
economicActivityFemale -1.883e-02  1.865e-02  -1.009  0.32341    
illiteracyMale          1.001e-01  5.422e-02   1.846  0.07773 .  
illiteracyFemale       -1.227e-01  4.840e-02  -2.535  0.01850 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.247 on 23 degrees of freedom
Multiple R-squared:  0.9906,    Adjusted R-squared:  0.9845 
F-statistic: 161.5 on 15 and 23 DF,  p-value: < 2.2e-16

Many predictors are not significant, suggesting possible multicollinearity. GDP per Capita has a very small coefficient (8.485e-05), suggesting it might benefit from log transformation. The residuals range from -2.08 to 1.61, which isn’t terrible but could potentially be improved.

plot(model_complete)

vif(model_complete)
                            GVIF Df GVIF^(1/(2*Df))
region                 33.032280  4        1.548344
tfr                    18.312361  1        4.279295
contraception           4.202452  1        2.049988
educationMale          38.164269  1        6.177724
educationFemale        49.808193  1        7.057492
lifeMale               13.991749  1        3.740555
infantMortality        23.267489  1        4.823639
GDPperCapita            2.559988  1        1.599996
economicActivityMale    3.281195  1        1.811407
economicActivityFemale  2.468757  1        1.571228
illiteracyMale         18.958728  1        4.354162
illiteracyFemale       33.081818  1        5.751680

do data transformation 1

model_transformed <- lm(lifeFemale ~ tfr + lifeMale + 
                       log(GDPperCapita) + economicActivityFemale + log(educationFemale) + economicActivityMale+ contraception,
                       data = united_data_complete)
summary(model_transformed)

Call:
lm(formula = lifeFemale ~ tfr + lifeMale + log(GDPperCapita) + 
    economicActivityFemale + log(educationFemale) + economicActivityMale + 
    contraception, data = united_data_complete)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6189 -0.7120 -0.1288  0.7447  2.7710 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            19.054690   4.790977   3.977 0.000389 ***
tfr                    -2.056746   0.381891  -5.386 7.11e-06 ***
lifeMale                0.847174   0.059300  14.286 3.49e-15 ***
log(GDPperCapita)       0.183585   0.254628   0.721 0.476316    
economicActivityFemale  0.004541   0.018056   0.251 0.803100    
log(educationFemale)    0.466637   0.969913   0.481 0.633816    
economicActivityMale    0.016441   0.036739   0.448 0.657619    
contraception          -0.031832   0.019021  -1.674 0.104278    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.439 on 31 degrees of freedom
Multiple R-squared:  0.9831,    Adjusted R-squared:  0.9793 
F-statistic: 257.7 on 7 and 31 DF,  p-value: < 2.2e-16
plot(model_transformed)

vif(model_transformed)
                   tfr               lifeMale      log(GDPperCapita) 
              5.955934               5.129213               2.158457 
economicActivityFemale   log(educationFemale)   economicActivityMale 
              1.735643               2.627796               1.705966 
         contraception 
              2.731702 
anova(model_complete, model_transformed)
Analysis of Variance Table

Model 1: lifeFemale ~ region + tfr + contraception + educationMale + educationFemale + 
    lifeMale + infantMortality + GDPperCapita + economicActivityMale + 
    economicActivityFemale + illiteracyMale + illiteracyFemale
Model 2: lifeFemale ~ tfr + lifeMale + log(GDPperCapita) + economicActivityFemale + 
    log(educationFemale) + economicActivityMale + contraception
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     23 35.743                              
2     31 64.200 -8   -28.456 2.2889 0.05752 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From influential points and outliers, we can see Latvia, Indonesia, Botswana appear as potential outliers. The residuals vs fitted plot shows a slight pattern, suggesting potential non-linearity. The Q-Q plot shows some deviation from normality at the tails. There are high VIF values for several predictors:tfr (7.1), lifeMale (5.72), illiteracyFemale (6.7), suggestting a potential multicollinearity.I try to improve the model again. The anova test has a p-value (0.05752) which is significant at the level of 0.05, which indicates that the model 2 has a better fit than the original model.

data transformation 2

model_improved <- lm(lifeFemale ~ tfr + lifeMale + 
                       log(GDPperCapita) + economicActivityFemale + log(educationFemale) ,
                       data = united_data_complete)
summary(model_improved)

Call:
lm(formula = lifeFemale ~ tfr + lifeMale + log(GDPperCapita) + 
    economicActivityFemale + log(educationFemale), data = united_data_complete)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7193 -0.7889 -0.2196  0.9811  3.2810 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            18.448869   4.839185   3.812 0.000571 ***
tfr                    -1.741571   0.300306  -5.799 1.74e-06 ***
lifeMale                0.854868   0.052140  16.396  < 2e-16 ***
log(GDPperCapita)       0.148270   0.257217   0.576 0.568228    
economicActivityFemale  0.005607   0.016670   0.336 0.738727    
log(educationFemale)    0.073993   0.954033   0.078 0.938648    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.46 on 33 degrees of freedom
Multiple R-squared:  0.9815,    Adjusted R-squared:  0.9787 
F-statistic: 350.1 on 5 and 33 DF,  p-value: < 2.2e-16
plot(model_improved)

vif(model_improved)
                   tfr               lifeMale      log(GDPperCapita) 
              3.579619               3.853990               2.140758 
economicActivityFemale   log(educationFemale) 
              1.437828               2.471110 
anova(model_transformed, model_improved)
Analysis of Variance Table

Model 1: lifeFemale ~ tfr + lifeMale + log(GDPperCapita) + economicActivityFemale + 
    log(educationFemale) + economicActivityMale + contraception
Model 2: lifeFemale ~ tfr + lifeMale + log(GDPperCapita) + economicActivityFemale + 
    log(educationFemale)
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     31 64.200                           
2     33 70.315 -2    -6.115 1.4764 0.2441

VIF values are improved (all < 4), indicating reduced multicollinearity. The p-value = 0.2441 (> 0.05) for the anova test means there’s no significant difference between Model 1 and Model 2. We can use the simpler Model 2 (removing economicActivityMale and contraception doesn’t significantly worsen the model).

For each one-unit increase in fertility rate, female life expectancy decreases by 1.74 years, and it’s highly significant.(p < 0.001) For each one-year increase in male life expectancy, female life expectancy increases by 0.85 years, and it’s highly significant.(p < 0.001) For GDP, Female Economic Activity, and Female Education, they are not statistically significant. The model explains 98.15% of the variance in female life expectancy.

  1. Now redo your analysis in (a) but use multiple imputation.
Solution
set.seed(123)
imp <- mice(united_data, m = 5, method="pmm", seed=123)

 iter imp variable
  1   1  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  1   2  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  1   3  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  1   4  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  1   5  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  2   1  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  2   2  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  2   3  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  2   4  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  2   5  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  3   1  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  3   2  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  3   3  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  3   4  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  3   5  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  4   1  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  4   2  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  4   3  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  4   4  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  4   5  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  5   1  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  5   2  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  5   3  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  5   4  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
  5   5  tfr  contraception  educationMale  educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  economicActivityMale  economicActivityFemale  illiteracyMale  illiteracyFemale
model_imp_complete <- with(imp, lm(lifeFemale ~ region + tfr + contraception + educationMale + 
       educationFemale + lifeMale + infantMortality + GDPperCapita + 
       economicActivityMale + economicActivityFemale + 
       illiteracyMale + illiteracyFemale))

pooled_complete <- pool(model_imp_complete)
summary(pooled_complete)
                     term      estimate    std.error   statistic         df
1             (Intercept)  1.996385e+01 3.181402e+00  6.27517500  42.725077
2           regionAmerica  4.426463e-01 4.961102e-01  0.89223372  49.993902
3              regionAsia  4.272759e-02 4.111295e-01  0.10392733  43.711301
4            regionEurope  1.013285e+00 4.970044e-01  2.03878472  56.151029
5           regionOceania -4.474448e-01 5.119470e-01 -0.87400597 157.035616
6                     tfr -6.983754e-01 1.626416e-01 -4.29395199  26.611063
7           contraception -2.428545e-02 1.021230e-02 -2.37806011  20.585202
8           educationMale  1.436599e-01 3.560962e-01  0.40343007   5.053972
9         educationFemale -1.311389e-01 3.738908e-01 -0.35074127   4.782813
10               lifeMale  8.547754e-01 3.761988e-02 22.72137563  68.363943
11        infantMortality -3.497512e-02 1.436239e-02 -2.43518806   9.664250
12           GDPperCapita  2.260934e-07 1.709956e-05  0.01322218  63.418593
13   economicActivityMale -1.342413e-02 2.147323e-02 -0.62515644  21.072391
14 economicActivityFemale  9.664915e-03 9.370237e-03  1.03144829  29.767437
15         illiteracyMale  4.732288e-02 2.758298e-02  1.71565504  24.656082
16       illiteracyFemale -5.867731e-02 2.354499e-02 -2.49213623  22.486989
        p.value
1  1.491394e-07
2  3.765422e-01
3  9.177024e-01
4  4.618996e-02
5  3.834496e-01
6  2.077617e-04
7  2.718266e-02
8  7.031345e-01
9  7.406993e-01
10 1.411941e-33
11 3.592168e-02
12 9.894920e-01
13 5.385801e-01
14 3.106387e-01
15 9.876715e-02
16 2.053217e-02
fit_stats <- pool.r.squared(model_imp_complete)
print(fit_stats)
          est    lo 95     hi 95       fmi
R^2 0.9847297 0.979497 0.9886347 0.1570263

lifeMale, tfr, contraception, infantMortality, illiteracyFemale, regionEurope are significant predictors.

densityplot(imp) 

Try improvement

model_imp_transformed <- with(imp, 
    lm(lifeFemale ~ region + 
       log(tfr) +             
       log(lifeMale) + 
       I(infantMortality^2) +
       log(contraception) + 
       log(illiteracyFemale) + economicActivityFemale + log(GDPperCapita)))

pooled_transformed <- pool(model_imp_transformed)
summary(pooled_transformed)
                     term      estimate    std.error    statistic         df
1             (Intercept) -1.569690e+02 7.569300e+00 -20.73758449 103.403561
2           regionAmerica  3.833838e-01 3.924632e-01   0.97686560 192.193737
3              regionAsia -6.171583e-01 3.596034e-01  -1.71621916 144.112170
4            regionEurope  1.922553e-01 4.830477e-01   0.39800485  89.122067
5           regionOceania -6.652831e-01 6.225888e-01  -1.06857536  20.653250
6                log(tfr) -2.716701e+00 4.517948e-01  -6.01313096 105.725811
7           log(lifeMale)  5.513827e+01 1.895414e+00  29.09036240  72.295930
8    I(infantMortality^2)  3.599214e-06 4.682285e-05   0.07686874  76.135919
9      log(contraception) -4.761709e-01 2.158387e-01  -2.20614191 124.006342
10  log(illiteracyFemale) -5.608013e-01 1.209680e-01  -4.63594873  13.878828
11 economicActivityFemale  1.269129e-02 6.904894e-03   1.83801315  94.814683
12      log(GDPperCapita)  3.612626e-01 1.401303e-01   2.57804700   9.204761
        p.value
1  1.255615e-38
2  3.298637e-01
3  8.827070e-02
4  6.915784e-01
5  2.975921e-01
6  2.638188e-08
7  1.205907e-41
8  9.389296e-01
9  2.921619e-02
10 3.939025e-04
11 6.919171e-02
12 2.927951e-02
fit_stats_transformed <- pool.r.squared(model_imp_transformed)
print(fit_stats_transformed)
          est   lo 95     hi 95       fmi
R^2 0.9851895 0.97991 0.9890892 0.2196999
anova(model_imp_complete, model_imp_transformed)
   test statistic df1      df2 dfcom      p.value     riv
 1 ~~ 2  228.9663  10 70.28529   191 1.744554e-49 1.11773

I have made some transformation of the model, from the summary, we can see that the overall R^2 has increased, and the p-value in the anova test shows that there is a significant difference between the original complete model and the improved one.

  1. Compare these results to those from the complete-case analysis. What do you conclude?
Solution

compare for r squared value:

r2_complete_a <- summary(model_complete)$r.squared
r2_improved_a <- summary(model_improved)$r.squared
print(r2_complete_a)
[1] 0.9905933
print(r2_improved_a)
[1] 0.9814951

2. Exercise D20.3 (Selection)

Long (1997) reports a regression in which the response variable is the prestige of the academic departments where PhDs in biochemistry find their first jobs. The data are in the file Long-PhDs.txt.

Prestige is measured on a scale that runs from 1.00 to 5.00, and is unavailable for departments without graduate programs and for departments with ratings below 1.00. The explanatory variables include a dummy regressor for gender; the prestige of the department in which the individual obtained his or her PhD; the number of citations received by the individualís mentor; a dummy regressor coding whether or not the individual held a fellowship; the number of articles published by the individual; and the number of citations received by the individual.

Estimate the regression of prestige of first job on the other variables in three ways:

  1. code all of the missing values as 1.00 and perform an OLS regression;
Solution
phd_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/Long-PhDs.txt")
phd_data$gender <- factor(phd_data$gender)
phd_data$fellowship <- factor(phd_data$fellowship)
missing_count <- colSums(is.na(phd_data))
print("Missing values per column:")
[1] "Missing values per column:"
print(missing_count)
       job     gender        phd     mentor fellowship   articles  citations 
        99          0          0          0          0          0          0 
phd_data_a <- phd_data
phd_data_a$job[is.na(phd_data_a$job)] <- 1.00
model_a <- lm(job ~ gender + phd + mentor + fellowship + articles + citations, 
              data = phd_data_a)
summary(model_a)

Call:
lm(formula = job ~ gender + phd + mentor + fellowship + articles + 
    citations, data = phd_data_a)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.80152 -0.70105 -0.03821  0.68599  2.17601 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9279903  0.1641470   5.653 2.99e-08 ***
gendermale    0.1391939  0.0902344   1.543   0.1237    
phd           0.2726826  0.0493183   5.529 5.81e-08 ***
mentor        0.0011867  0.0007012   1.692   0.0913 .  
fellowshipyes 0.2341384  0.0948206   2.469   0.0140 *  
articles      0.0228011  0.0288843   0.789   0.4303    
citations     0.0044788  0.0019687   2.275   0.0234 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8717 on 401 degrees of freedom
Multiple R-squared:  0.2101,    Adjusted R-squared:  0.1983 
F-statistic: 17.78 on 6 and 401 DF,  p-value: < 2.2e-16

The overall model is statistically significant (F-statistic: 17.78, p-value < 2.2e-16) and explains about 21% of the variance in job prestige. There is a significant positive relationship between the prestige of the department where individuals obtained their PhD and their first job prestige. For each one-point increase in PhD department prestige, first job prestige increases by 0.273 points on average, holding other factors constant.

Fellowship Status is significant in the 0.01 level. Individuals with fellowships tend to secure first jobs in departments with prestige ratings about 0.23 points higher than those without fellowships.

The number of citations received by the individual has a small but significant positive effect on job prestige. Each additional citation is associated with a 0.004 point increase in first job prestige.

The citations received by an individual’s mentor show a marginally significant positive effect, suggesting some benefit from having a well-cited mentor.

Gender: Being male is associated with a 0.139 point increase in job prestige compared to being female, but this effect is not statistically significant (p = 0.124).

The number of articles published shows a positive but non-significant relationship with job prestige (p = 0.430).

  1. treat the missing values as truncated at 1.00 and employ Heckmanís selection-regression model;
Solution
library(sampleSelection)
phd_data_b <- phd_data
phd_data_b$observed <- ifelse(!is.na(phd_data_b$job), 1, 0)

phd_data_b$job_trunc <- phd_data_b$job
phd_data_b$job_trunc[is.na(phd_data_b$job_trunc)] <- 1.00

heckman_model <- selection(
  selection = observed ~ gender + phd + mentor + fellowship + articles + citations,
  outcome = job_trunc ~ gender + phd + mentor + fellowship + articles + citations,
  data = phd_data_b,
  method = "2step"
)
summary(heckman_model)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
408 observations (99 censored and 309 observed)
17 free parameters (df = 392)
Probit selection equation:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)   -0.413907   0.262790  -1.575  0.11605   
gendermale     0.453428   0.147240   3.080  0.00222 **
phd            0.128267   0.080552   1.592  0.11211   
mentor         0.001035   0.001332   0.777  0.43786   
fellowshipyes  0.259776   0.155315   1.673  0.09521 . 
articles       0.052669   0.061404   0.858  0.39156   
citations      0.010287   0.005373   1.914  0.05629 . 
Outcome equation:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.3932373  0.6480052   2.150   0.0322 *  
gendermale    -0.0758877  0.1574188  -0.482   0.6300    
phd            0.3050634  0.0611690   4.987 9.22e-07 ***
mentor         0.0008408  0.0006884   1.221   0.2227    
fellowshipyes  0.1603983  0.1361823   1.178   0.2396    
articles       0.0084220  0.0279506   0.301   0.7633    
citations      0.0023844  0.0021991   1.084   0.2789    
Multiple R-Squared:0.2014,  Adjusted R-Squared:0.1829
   Error terms:
              Estimate Std. Error t value Pr(>|t|)
invMillsRatio   0.1332     0.6905   0.193    0.847
sigma           0.7001         NA      NA       NA
rho             0.1903         NA      NA       NA
--------------------------------------------

For selection equation: Gender: Being male significantly increases the probability of having an observed job prestige value (p < 0.01). This suggests that male PhDs are more likely to find jobs in departments where prestige ratings are available. Having a fellowship shows a marginally significant positive effect on having an observed prestige value, indicating that fellowship holders may be more likely to secure positions in departments with available prestige ratings. Individual citations have a marginally significant positive effect (β = 0.010, p < 0.10) on whether job prestige is observed, suggesting that more cited individuals are somewhat more likely to have jobs with measurable prestige. PhD department prestige, mentor citations, and number of articles do not significantly predict whether job prestige is observed.

For outcome equation: only phd has the most significant impact on the outcome. Other predictors don’t show significant impact on job prestige. The inverse Mills ratio is not statistically significant (p = 0.847), suggesting that selection bias may not be a major concern in this model. The low rho value (0.1903) also indicates a relatively weak correlation between the error terms in the selection and outcome equations.

  1. treat the missing values as censored and fit the Tobit model.
Solution
library(AER)
phd_data_c <- phd_data
phd_data_c$censored <- is.na(phd_data_c$job)
phd_data_c$job_censored <- phd_data_c$job
phd_data_c$job_censored[is.na(phd_data_c$job_censored)] <- 1.00
tobit_model <- tobit(job_censored ~ gender + phd + mentor + fellowship + articles + citations, 
                    left = 1.00, 
                    data = phd_data_c)
summary(tobit_model)

Call:
tobit(formula = job_censored ~ gender + phd + mentor + fellowship + 
    articles + citations, left = 1, data = phd_data_c)

Observations:
         Total  Left-censored     Uncensored Right-censored 
           408             99            309              0 

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.4485575  0.2178915   2.059   0.0395 *  
gendermale    0.2368486  0.1165852   2.032   0.0422 *  
phd           0.3225846  0.0639229   5.046  4.5e-07 ***
mentor        0.0013436  0.0008876   1.514   0.1301    
fellowshipyes 0.3252657  0.1224575   2.656   0.0079 ** 
articles      0.0339053  0.0365017   0.929   0.3530    
citations     0.0050900  0.0024752   2.056   0.0397 *  
Log(scale)    0.0836396  0.0428043   1.954   0.0507 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Scale: 1.087 

Gaussian distribution
Number of Newton-Raphson Iterations: 4 
Log-likelihood: -560.3 on 8 Df
Wald-statistic: 96.21 on 6 Df, p-value: < 2.22e-16 

The model is highly significant (Wald-statistic: 96.21, p-value < 2.22e-16). The prestige of the department where individuals obtained their PhD is the strongest predictor of first job prestige. For each one-point increase in PhD department prestige, the expected job prestige increases by about 0.32 points, holding other factors constant.

Having held a fellowship significantly increases expected job prestige. Individuals with fellowships tend to secure first jobs in departments with prestige ratings about 0.33 points higher than those without fellowships.

Being male is associated with higher job prestige (β = 0.237, p < 0.05). Male PhDs tend to secure first jobs in departments with prestige ratings about 0.24 points higher than female PhDs, after controlling for other factors.

The number of citations received by the individual has a small but significant positive effect. Each additional citation is associated with a 0.005 point increase in expected job prestige.

Neither mentor citations nor the number of articles published show statistically significant effects on job prestige, though both have positive coefficients.

  1. Compare the estimates and coefficient standard errors obtained by the three approaches. Which of these approaches makes the most substantive sense?
Solution
ols_results <- summary(model_a)$coefficients
heckman_results <- summary(heckman_model)$estimate
tobit_results <- summary(tobit_model)$coefficients

common_vars <- Reduce(intersect, list(rownames(ols_results), rownames(heckman_results), rownames(tobit_results)))
ols_subset <- ols_results[common_vars, , drop=FALSE]
heckman_subset <- heckman_results[common_vars, , drop=FALSE]
tobit_subset <- tobit_results[common_vars, , drop=FALSE]

comparison_df <- data.frame(
  Variable = common_vars,
  OLS_Coef = ols_subset[, 1],
  OLS_StdErr = ols_subset[, 2],
  Heckman_Coef = heckman_subset[, 1],
  Heckman_StdErr = heckman_subset[, 2],
  Tobit_Coef = tobit_subset[, 1],
  Tobit_StdErr = tobit_subset[, 2]
)

print(comparison_df)
                   Variable    OLS_Coef   OLS_StdErr Heckman_Coef
(Intercept)     (Intercept) 0.927990303 0.1641469701 -0.413907492
gendermale       gendermale 0.139193935 0.0902344238  0.453428161
phd                     phd 0.272682644 0.0493183451  0.128267082
mentor               mentor 0.001186708 0.0007011642  0.001034672
fellowshipyes fellowshipyes 0.234138434 0.0948206487  0.259775621
articles           articles 0.022801124 0.0288842707  0.052668574
citations         citations 0.004478843 0.0019686646  0.010287045
              Heckman_StdErr  Tobit_Coef Tobit_StdErr
(Intercept)      0.262790471 0.448557544 0.2178915284
gendermale       0.147240499 0.236848569 0.1165851688
phd              0.080551984 0.322584618 0.0639228959
mentor           0.001332314 0.001343628 0.0008875672
fellowshipyes    0.155315145 0.325265728 0.1224574981
articles         0.061404494 0.033905332 0.0365017207
citations        0.005373294 0.005090006 0.0024751689

PhD is consistently significant across all three approaches. This suggests that PhD department prestige is a robust predictor of first job prestige regardless of how missing data is handled. Gender Effects show striking inconsistency. In the Heckman model, gender has a large positive coefficient (0.453) in the selection equation but non-significant in the OLS approach (0.139). The Tobit model shows a stronger effect (0.237) than OLS. This suggests that gender may influence both the selection process and outcome differently. Fellowship Status has consistent positive effects in OLS (0.234) and Tobit (0.325) but shows up primarily in the selection equation of the Heckman model (0.260), suggesting it affects whether one gets a job with measurable prestige rather than the prestige level itself.

The Tobit model makes the most substantive sense for several reasons: 1. The Heckman model reveals that gender and citations affect whether job prestige is observed, indicating a selection process that Tobit can handle more parsimoniously. 2. The Tobit model produces estimates that generally align with theoretical expectations about academic job markets, where PhD prestige, fellowship status, gender, and citations all matter. 3. Tobit generally has more reasonable standard errors than Heckman while accounting for the censoring mechanism (unlike OLS).

3. Exercise (Bootstrap)

We will now consider the Boston housing dataset from the MASS package.

library(MASS)
library(boot)
data(Boston, package = "MASS")

Run ??Boston in condole to see codebook.

  1. Provide an estimate of the population mean of medv. Call this estimate \(\hat{\mu}\).
Solution
summary(Boston$medv)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00   17.02   21.20   22.53   25.00   50.00 
mu_hat <- mean(Boston$medv)
mu_hat 
[1] 22.53281
  1. What is the formula for the standard error of an estimate of the mean? Use this to provide an estimate of the standard error of \(\hat{\mu}\) in (a).
Solution

The formula is s / sqrt(n), which is the sample standard deviation / square root of the number of observations

# sample standard deviation
s <- sd(Boston$medv)
# sample size
n <- length(Boston$medv)

# standard error of the mean
SE_mu_hat <- s / sqrt(n)
SE_mu_hat  
[1] 0.4088611
  1. Estimate this standard error using the bootstrap. How does this compare to the answer from (b)?
Solution
mean_func <- function(data, indices) {
  return(mean(data[indices]))
}
set.seed(123) 
bootstrap_results <- boot(data = Boston$medv, 
                         statistic = mean_func, 
                         R = 10000)
print(bootstrap_results)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston$medv, statistic = mean_func, R = 10000)


Bootstrap Statistics :
    original       bias    std. error
t1* 22.53281 -0.002135435   0.4088195
# The bootstrap standard error
bootstrap_se <- sd(bootstrap_results$t)
cat("Classical estimate of standard error of mean:", SE_mu_hat, "\n")
Classical estimate of standard error of mean: 0.4088611 
cat("Bootstrap estimate of standard error of mean:", bootstrap_se, "\n")
Bootstrap estimate of standard error of mean: 0.4088195 

The bootstrap estimate of standard error is slightly smaller than the classical estimate of standard error, which makes sense since the bootstrap procedure is able to decrease the level of error, which will make the standard error to be smaller.

  1. Provide an estimate of \(\hat{\mu}_{med}\), the median value of medv in the population.
Solution
median_hat <- median(Boston$medv)
median_hat
[1] 21.2
  1. Estimate the standard error of \(\hat{\mu}_{med}\). Notice that there is no simple formula to do this, so instead use the bootstrap. Comment on your findings.
Solution
median_func <- function(data, indices) {
  return(median(data[indices]))
}
set.seed(123) 
bootstrap_results <- boot(data = Boston$medv, 
                         statistic = median_func, 
                         R = 10000)
median_estimate <- bootstrap_results$t0
median_se <- sd(bootstrap_results$t)
cat("Bootstrap estimate of standard error of median:", median_se, "\n")
Bootstrap estimate of standard error of median: 0.3773774 
hist(Boston$medv, breaks = 15)

The histogram of medv shows a right-skewed distribution, meaning there are high-value outliers. The standard error of the median is smaller than the standard error of the mean. This suggests that the distribution of medv may have some variability, but the median is slightly more stable. If the distribution were perfectly symmetric, we would expect the SEs to be almost equal. The mean is more sensitive to outliers, which could contribute to its slightly higher variability.

4. Exercise D22.1 (Model Selection)

The data file BaseballPitchers.txt contains salary and performance data for major-league baseball pitchers at the start of the 1987 season. The data are analogous to those for baseball hitters used as an example in the chapter. Be sure to explore the data and think about variables to use as predictors before specifying candidate models.

  1. Employing one or more of the methods of model selection described in the text, develop a regression model to predict pitchers’ salaries.
Solution
baseball_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/BaseballPitchers.txt", header=TRUE)
baseball_clean <- baseball_data %>% filter(!is.na(salary))
baseball_clean$league86 <- as.factor(baseball_clean$league86)
baseball_clean$league87 <- as.factor(baseball_clean$league87)
baseball_clean$team86 <- as.factor(baseball_clean$team86)
baseball_clean$team87 <- as.factor(baseball_clean$team87)
full_model <- lm(salary ~ league86 + team86 + W86 + L86 + ERA86 + G86 + IP86 + SV86 + years + 
                  careerW + careerL + careerERA + careerG + careerIP + careerSV + league87 + team87,
                data = baseball_clean)
null_model <- lm(salary ~ 1, data = baseball_clean)

Use AIC and BIC

# Backward selection using AIC
backward_aic <- step(full_model, direction = "backward", trace = FALSE)
summary(backward_aic)$adj.r.squared
[1] 0.4880959
AIC_backward <- AIC(backward_aic)
AIC_backward
[1] 2494.15
# Backward selection using BIC
backward_bic <- stepAIC(full_model, direction="backward", trace = FALSE, k=log(nrow(baseball_clean)))
summary(backward_bic)$adj.r.squared
[1] 0.3810568
BIC_backward <- BIC(backward_bic)
BIC_backward
[1] 2520.265
# Stepwise selection using AIC
stepwise_aic <- step(null_model, 
                      scope = list(lower = formula(null_model), 
                                  upper = formula(full_model)),
                      direction = "both",trace = FALSE)
summary(stepwise_aic)$adj.r.squared
[1] 0.4879433
AIC_stepwise <- AIC(stepwise_aic)
AIC_stepwise
[1] 2493.403
# Stepwise selection using BIC
stepwise_bic <- stepAIC(full_model, direction="both", trace = FALSE, k=log(nrow(baseball_clean)))
summary(stepwise_bic)$adj.r.squared
[1] 0.397887
BIC_stepwise <- BIC(stepwise_bic)
BIC_stepwise
[1] 2519.557

AIC models explain more variance (higher R^2 value) but are more complex. BIC models are more parsimonious but explain less variance. According to the model selection using AIC, the stepwise selection performed slightly better than backward selection for both AIC and BIC. The differences are small, but this suggests that some variables initially excluded in backward selection might be valuable predictors when considered in combination with others.

final model

summary(stepwise_aic)

Call:
lm(formula = salary ~ years + careerERA + IP86 + team87 + careerSV + 
    league87, data = baseball_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-609.2 -158.1  -26.6  125.1  871.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   57.6001   226.9173   0.254 0.799976    
years         34.2776     5.5267   6.202 5.36e-09 ***
careerERA   -113.6827    41.9487  -2.710 0.007528 ** 
IP86           1.9527     0.4111   4.750 4.78e-06 ***
team87Bal.   443.1004   172.2881   2.572 0.011107 *  
team87Bos.   366.8161   172.5102   2.126 0.035145 *  
team87Cal.   253.6172   194.5625   1.304 0.194433    
team87Chi.   489.7911   129.4280   3.784 0.000224 ***
team87Cin.   160.9367   144.1991   1.116 0.266212    
team87Cle.   -44.2065   178.0179  -0.248 0.804229    
team87Det.   548.6312   173.3258   3.165 0.001883 ** 
team87Hou.    34.7395   134.1054   0.259 0.795962    
team87K.C.   385.5194   174.7426   2.206 0.028921 *  
team87L.A.   281.2451   133.5009   2.107 0.036843 *  
team87Mil.   217.7212   186.0869   1.170 0.243895    
team87Min.   380.3508   173.7509   2.189 0.030170 *  
team87Mon.  -148.4768   144.1560  -1.030 0.304714    
team87N.Y.   255.4010   133.5029   1.913 0.057683 .  
team87Oak.   292.4248   163.7470   1.786 0.076188 .  
team87Phi.    43.6651   138.4021   0.315 0.752834    
team87Pit.    57.3938   139.5902   0.411 0.681554    
team87S.D.   147.4488   138.6138   1.064 0.289192    
team87S.F.     6.6883   144.6113   0.046 0.963174    
team87Sea.   199.1914   187.0554   1.065 0.288677    
team87St.L.   82.6906   145.6265   0.568 0.571019    
team87Tex.   197.2861   169.4256   1.164 0.246132    
team87Tor.   310.1011   174.6704   1.775 0.077909 .  
careerSV       1.3671     0.6250   2.187 0.030311 *  
league87N    207.4445   105.3955   1.968 0.050921 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 266.2 on 147 degrees of freedom
Multiple R-squared:  0.5699,    Adjusted R-squared:  0.4879 
F-statistic: 6.956 on 28 and 147 DF,  p-value: 8.51e-16
  1. How successful is the model in predicting salaries? Does the model make substantive sense?
Solution

test for model’s VIF value

library(car)
vif_results <- vif(stepwise_aic)
print(vif_results)
               GVIF Df GVIF^(1/(2*Df))
years      1.567258  1        1.251902
careerERA  1.668797  1        1.291819
IP86       1.655416  1        1.286630
team87    12.336876 23        1.056141
careerSV   1.914102  1        1.383511
league87   6.866400  1        2.620382

From here, we can see that all six predictors have VIF value smaller than 5, which indicates that there is no possible issue for problematic multicollinearity.

use Cross-Validation

library(caret)
cv_results <- train(salary ~ years + careerERA + IP86 + team87 + careerSV + league87,
                    data = baseball_clean,
                    method = "lm",
                    trControl = trainControl(method = "cv", number = 10))
print(cv_results)
Linear Regression 

176 samples
  6 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 157, 160, 157, 160, 159, 158, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  287.4584  0.4387928  221.0161

Tuning parameter 'intercept' was held constant at a value of TRUE

The model explains about 44% of the variance in pitcher salaries when applied to new data. This is lower than the adjusted R-squared of 0.4879 from the full model. MAE is lower than RMSE, indicating some large errors are increasing the RMSE.

For the drop in R-squared, the cross-validation R-squared (0.4397) is about 9% lower than your model’s adjusted R-squared (0.4879). This indicates some overfitting, but not severe.

The cross-validation RMSE (299.17) is higher than the residual standard error from your full model (266.2), showing a roughly 12% increase in prediction error when applied to new data.

For practical scenario, the average error (RMSE 300,000) needs to be interpreted relative to the typical salary range. If the average salary is around 600,000-800,000, this represents a substantial percentage error.